Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parseFloatThousandSep #15421

Closed
wants to merge 86 commits into from
Closed

Add parseFloatThousandSep #15421

wants to merge 86 commits into from

Conversation

juancarlospaco
Copy link
Collaborator

@juancarlospaco juancarlospaco commented Sep 28, 2020

  • Convenience func for parseFloat designed to parse floats with thousand separators as found in the wild formatted for humans.
  • Documentation, runnableExamples with doAssert, since, changelog, etc.
doAssert parseFloatThousandSep("1,000,000.000") == 1000000.0
doAssert parseFloatThousandSep("1'000'000.000", thousandSep = '\'') == 1000000.0

echo parseFloat("1,000,000.000")
  Error: unhandled exception: invalid float: 1,000,000.000 [ValueError]
echo parseFloat("1'000'000.000")
  Error: unhandled exception: invalid float: 1'000'000.000 [ValueError]

:)

links

lib/pure/strmisc.nim Outdated Show resolved Hide resolved
@Vindaar
Copy link
Contributor

Vindaar commented Sep 29, 2020

I think this might be a good addition.

However, if we are to have something like this I would trade correctness for speed. Perform a first pass over the string and make explicit checks, namely:

  • no separator before a digit
  • first separator can be anywhere after first digit
  • there has to be 3 digits between successive separators and between the last separator and the decimal dot
  • no separator after decimal dot
  • ...? (maybe I forgot something else)

Have that pass over the string remove the separators and hand the result to normal parseFloat.

@juancarlospaco
Copy link
Collaborator Author

juancarlospaco commented Sep 29, 2020

@Vindaar
Have any pseudocode, code or github suggestion on how you imagine it?, your comment makes sense.

@juancarlospaco
Copy link
Collaborator Author

This is for strings as formatted for humans with quirky punctuation,
sometimes used for money, coords, big floats on sciences, etc etc.

Thats why I named differently, because does different things than parseFloat.
Can be made stricter and have more options yeah...

@juancarlospaco juancarlospaco marked this pull request as draft September 29, 2020 10:25
@Vindaar
Copy link
Contributor

Vindaar commented Sep 29, 2020

@juancarlospaco I'll write a couple of lines later if I don't forget (ping me otherwise). Not a parsing expert though. :P

@Vindaar
Copy link
Contributor

Vindaar commented Sep 29, 2020

Probably neither the smartest, nor most efficient way, but this is what I came up with:

edit: haha, I just noticed I forgot about both minus and exp notation. The latter isn't important (who in their right mind would combine the two, but the former kinda is). If you think this is useful, I can add it. But won't do it now. Doesn't change the idea / approach.

import parseutils, strutils

proc parseFloatThousandSep(s: string,
                           sep: static char = ',',
                           decimalDot: static char = '.'): float = # maybe should not have a default sep?
  ## version of `parseFloat` which allows for thousand separators.
  ## The following assumptions / requirements must be met by the string
  ## - no separator before a digit
  ## - first separator can be anywhere after first digit, but no more than 3 characters
  ## - there ``has`` to be 3 digits between successive separators
  ##   - and between the last separator and the decimal dot
  ## - no separator after decimal dot
  ## - no duplicate separators
  ## - floats without separator allowed
  var
    buf = s
    idx = 0
    successive = 0
    afterDot = false
    lastWasDot = false
    lastWasSep = false
    hasAnySep = false
  template bail(msg: untyped): untyped =
    raise newException(ValueError, "Invalid float containing thousand separator." &
      " Reason: " & $msg)

  while idx < buf.len:
    case buf[idx]
    of sep:
      if idx == 0:
        bail("String starts with thousand separator.")
      elif lastWasSep:
        bail("Two separators in a row.")
      elif afterDot:
        bail("Separator found after decimal dot.")
      buf.delete(idx, idx)
      lastWasSep = true
      hasAnySep = true
      successive = 0
    of '0' .. '9':
      if hasAnySep and successive > 2:
        bail("More than 3 digits between thousand separators.")
      lastWasSep = false
      lastWasDot = false
      inc successive
      inc idx
    of decimalDot:
      if idx == 0:
        bail("String starts with decimal dot.")
      elif hasAnySep and successive != 3:
        bail("Not 3 successive digits before decimal point, despite larger 1000!")
      successive = 0
      lastWasDot = true
      afterDot = true
      inc idx
    else:
      # NOTE: could also move separator logic here if we wanted runtime separator
      # selection. Case needs CT info
      bail("Invalid character in float: " & $buf[idx])
  result = buf.parseFloat


doAssert parseFloatThousandSep("1.0") == 1.0
doAssert parseFloatThousandSep("1.000") == 1.0
doAssert parseFloatThousandSep("1,000") == 1000.0
doAssertRaises(ValueError):
  # invalid because , not the sep
  discard parseFloatThousandSep("1,000", sep = '\'')
# compile time error due to duplicate case label
# parseFloatThousandSep("1,000", sep = '.')
doAssert parseFloatThousandSep("10,000.000") == 10000.0
doAssertRaises(ValueError):
  # thousand sep after decimal dot
  discard parseFloatThousandSep("10.000,000")
doAssert parseFloatThousandSep("1,000,000.000") == 1000000.0
doAssert parseFloatThousandSep("10,000,000.000") == 10000000.0
doAssertRaises(ValueError):
  # starts with sep
  discard parseFloatThousandSep(",123.000")
doAssertRaises(ValueError):
  # starts with decimal dot
  discard parseFloatThousandSep(".000")
doAssertRaises(ValueError):
  # duplicate thousand sep
  discard parseFloatThousandSep("123,,100.0")
doAssertRaises(ValueError):
  # sep before dot
  discard parseFloatThousandSep("123,.0")

Up to someone else to decide if the separator should be compile time, but this made the code a lot nicer imo. Feel free to take it or leave it. :)

@juancarlospaco juancarlospaco marked this pull request as ready for review September 29, 2020 22:43
@juancarlospaco juancarlospaco requested a review from Araq September 29, 2020 22:43
lib/pure/strmisc.nim Outdated Show resolved Hide resolved
lib/pure/strmisc.nim Outdated Show resolved Hide resolved
@dom96
Copy link
Contributor

dom96 commented Oct 3, 2020

Cool, can we also get the opposite of this? :)

lib/pure/strmisc.nim Outdated Show resolved Hide resolved
lib/pure/strmisc.nim Outdated Show resolved Hide resolved
lib/pure/strmisc.nim Outdated Show resolved Hide resolved
lib/pure/strmisc.nim Outdated Show resolved Hide resolved
lib/pure/strmisc.nim Outdated Show resolved Hide resolved
@juancarlospaco juancarlospaco requested a review from Araq October 6, 2020 03:27
lib/pure/strmisc.nim Outdated Show resolved Hide resolved
@juancarlospaco juancarlospaco marked this pull request as ready for review November 24, 2020 20:35
doAssert parseFloatThousandSep("-Inf", {pfNanInf}) == -Inf
doAssert parseFloatThousandSep("+Inf", {pfNanInf}) == +Inf
doAssert parseFloatThousandSep("1000.000000E+90") == 1e93
doAssert parseFloatThousandSep("-10 000 000 000.0001", sep=' ') == -10000000000.0001
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parseFloatThousandSep("1e1") raises but should be accepted even without {pfDotOptional} because:

  • a user may not want to set {pfDotOptional} (too loose, eg would allow integers to be parsed as float)
  • nim: strutils.parseFloat accepts it
  • D: accepts it too (rdmd --eval 'writeln("1e1".to!double);')
  • python3: accepts it (float("1e1"))

@timotheecour
Copy link
Member

timotheecour commented Dec 2, 2020

@juancarlospaco I think https://github.com/nim-lang/Nim/pull/15421/files#r534562246 is my last comment, after that LGTM finally... unless i've missed some other case

EDIT: just found another bug: https://github.com/nim-lang/Nim/pull/15421/files#r534567618

please run sanity checks (or rather increase test coverage) before the next PTAL

doAssert parseFloatThousandSep("1000.000000E+90") == 1e93
doAssert parseFloatThousandSep("-10 000 000 000.0001", sep=' ') == -10000000000.0001
doAssert parseFloatThousandSep("-10 000 000 000,0001", sep=' ', decimalDot = ',') == -10000000000.0001
doAssert classify(parseFloatThousandSep("NaN", {pfNanInf})) == fcNan
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bug, this shouldn't be accepted:

nim> echo parseFloatThousandSep("inf.0", {pfNanInf})
0.0

@@ -0,0 +1,59 @@
import strmisc, math
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please move the content of tests/stdlib/tstrmiscs.nim into here (tstrmiscs.nim was badly named)


proc parseFloatThousandSepRaise(i: int; c: char; s: openArray[char]) {.noinline, noreturn.} =
raise newException(ValueError,
"Invalid float containing thousand separators, invalid char $1 at index $2 for input $3" %
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

      raise newException(ValueError,
        "Invalid float containing thousand separators, invalid char $1 at index $2 for input '$3'" % [$c, $i, s.join])

so that it shows as

'1,0000'

instead of

['1', ',', '0', '0', '0', '0'] 

parseFloatThousandSepRaise(0, sep, str) # "1,1"

if (strLen == 3 or strLen == 4) and (
(str[0] in {'i', 'I'} and str[1] in {'n', 'N'} and str[2] in {'f', 'F'}) or
Copy link
Member

@timotheecour timotheecour Dec 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cna you simplify this logic? it's buggy for 2 reasons:

nim> echo parseFloatThousandSep("infx", {pfNanInf})
Error: unhandled exception: invalid float: infx [ValueError]
nim> echo parseFloatThousandSep("infxx", {pfNanInf})
Error: unhandled exception: Invalid float containing thousand separators, invalid char , at index 0 for input 'infxx' [ValueError]

@juancarlospaco juancarlospaco marked this pull request as draft December 3, 2020 02:15
pfNanInf ## Allow "NaN", "Inf", "-Inf", etc.

func parseFloatThousandSep*(str: openArray[char]; options: set[ParseFloatOptions] = {};
sep = ','; decimalDot = '.'): float =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be discussed in future work, but just curious if there's a reverse proc for this?
I just found about insertSep, maybe they should cross-reference each other?
I'm not sure how robust insertSep is though; there's also formatFloat and formatEng, maybe they should be extended to support formatting with thousand separators

@Araq
Copy link
Member

Araq commented Dec 14, 2020

Sorry, rejected. For multiple reasons:

  • By design it cannot handle money which shouldn't be stored as a "float".
  • The openArray[char] interface is alien and not good enough for general parsing. (Hint: It doesn't return how many characters it returned so you're still left with a preprocessing step. Even if the string comes from an input field, you generally want to accept leading or trailing whitespace. So that would be yet another option pfAllowWhitespace)
  • The code is harder to use than the more hacky s.multiReplace({".": "", ",": ".").parseFloat (which would be quite appropriate for German btw).
  • The amount of options you can pass to the proc indicates that nobody really knows the use cases or that "every" use case should be covered. And yet, "money" is not among these.

@Araq Araq closed this Dec 14, 2020
@timotheecour
Copy link
Member

timotheecour commented Dec 16, 2020

/cc @Araq

For multiple reasons:

these are good arguments; how about the following instead:

proc parseIntCustom(a: string, start: int, T: typedesc[SomeInt], options: ParseOpt = {}): tuple[val: T, num: int] =
  ## val: parsed value; num: number of characters parsed on success or 0 on error
  runnableExamples:
    assert parseIntCustom("score: -1,234,567 name: foo", start = 5, int) == (-1,234,567, 10)
    assert parseIntCustom("--1", start = 0, int) == (0, 0) # because --
    assert parseIntCustom("-1", start = 0, uint) == (0, 0) # because uint

By design it cannot handle money which shouldn't be stored as a "float".

=> an API for that can be built on top of parseIntCustom to parse the integral part of the amount

The openArray[char] interface is alien and not good enough for general parsing

proposed API fixes that

The code is harder to use than the more hacky

but the more hacky way doesn't help with input validation

The amount of options you can pass to the proc indicates

I think parseIntCustom hits a sweet spot between complexity and usefulness; by restricting it to just integers (signed/unsigned), you avoid complexity of handling FP numbers; note that parsing FP numbers with thousand sep can use parseIntCustom as a building block, so can parsing of money.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants